[V1] Optimize block table transfer from CPU to GPU #11401

WoosukKwon · 2024-12-22T01:09:13Z

Currently, the block table transfer from CPU to GPU could be expensive because we send the entire block table ([batch_size, max_model_len // block_size]) every step. This PR optimizes the overhead by only sending the diffs from CPU to GPU, which is typically very small.

The solution in this PR relies on CUDA unified virtual addressing, so may not work in some environments. In such a case, we fall back to the original implementation (copying the entire block table tensor).

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2024-12-22T01:09:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <[email protected]>

youkaichao · 2024-12-23T05:24:03Z

csrc/prepare_inputs/copy_subranges.cu

+  int* d_matrix_tgt = matrix_tgt.data_ptr<int>();
+
+  // One thread block per row.
+  int blocks = n;


it seems this can easily oversubscribe GPU SMs.

youkaichao · 2024-12-23T05:25:21Z

csrc/prepare_inputs/copy_subranges.cu

+  int length = matrix_diff[row_id * 2 + 1];
+  int end = start + length;
+  int thread_idx = threadIdx.x;
+  for (int i = start + thread_idx; i < end; i += blockDim.x) {


most threads in the block would be idle, e.g. for decoding, there's only one or even no entry changes in the block table.

youkaichao · 2024-12-23T05:49:20Z

vllm/v1/worker/gpu_block_table.py

+            self.block_table_diff_np[row_idx, 0] = start
+            # Move-and-append is not allowed.
+            assert self.block_table_diff_np[row_idx, 1] == 0
+            self.block_table_diff_np[row_idx, 1] = num_blocks


for the non-uva case, we still need to keep track of the max-block-table-length, so that apply_diff only needs to copy max-block-table-length columns.

Good point. The problem is, the memcpy API requires the data to be in contiguous memory space: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

So when the block table tensor has the shape [batch_size, max_model_len] and if we slice over the second dimension, then we have to call the memcpy API batch_size times instead of once.

Signed-off-by: Woosuk Kwon <[email protected]>

tlrmchlsmth · 2024-12-29T18:43:14Z

csrc/prepare_inputs/copy_subranges.cu

+  int end = start + length;
+  int thread_idx = threadIdx.x;
+  for (int i = start + thread_idx; i < end; i += blockDim.x) {
+    int idx = row_offset + i;


Should row_offset and idx be int64_t? I.e. could they overflow an int32?

Signed-off-by: Woosuk Kwon <[email protected]>

wip

1aaced5

Signed-off-by: Woosuk Kwon <[email protected]>

mergify bot added the ci/build label Dec 22, 2024

WoosukKwon added 3 commits December 21, 2024 17:11

yapf

8a4180c

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

03b1e6f

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

0a669ee

Signed-off-by: Woosuk Kwon <[email protected]>

youkaichao reviewed Dec 23, 2024

View reviewed changes

WoosukKwon added 9 commits December 22, 2024 22:16

Use default

ee965c9

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-blocktable-opt

0420fb2

comments

3fdbd8e

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-blocktable-opt

b938606

Merge branch 'main' into v1-blocktable-opt

ff5b103

Minor

bef6816

Signed-off-by: Woosuk Kwon <[email protected]>

Add test for uva

5292219

Signed-off-by: Woosuk Kwon <[email protected]>

minor

ca4f9e6

Signed-off-by: Woosuk Kwon <[email protected]>

Add kernel test

27e8eb2

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon marked this pull request as ready for review December 26, 2024 20:01

WoosukKwon requested review from tlrmchlsmth, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners December 26, 2024 20:01

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 26, 2024

WoosukKwon added 3 commits December 26, 2024 18:52

Merge branch 'main' into v1-blocktable-opt

34d6cc2

Minor

6ba31aa

Signed-off-by: Woosuk Kwon <[email protected]>

ruff

ebfbe12

Signed-off-by: Woosuk Kwon <[email protected]>

tlrmchlsmth reviewed Dec 29, 2024

View reviewed changes

WoosukKwon marked this pull request as draft December 31, 2024 05:37

WoosukKwon added 2 commits January 1, 2025 03:10

Merge branch 'main' into v1-blocktable-opt

a6e5d7b

Minor

1260e43

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon mentioned this pull request Jan 2, 2025

[V1] Add BlockTable class #11693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Optimize block table transfer from CPU to GPU #11401

[V1] Optimize block table transfer from CPU to GPU #11401

WoosukKwon commented Dec 22, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 22, 2024

youkaichao Dec 23, 2024

youkaichao Dec 23, 2024

youkaichao Dec 23, 2024

WoosukKwon Dec 23, 2024 •

edited

Loading

tlrmchlsmth Dec 29, 2024

[V1] Optimize block table transfer from CPU to GPU #11401

Are you sure you want to change the base?

[V1] Optimize block table transfer from CPU to GPU #11401

Conversation

WoosukKwon commented Dec 22, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 22, 2024

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

WoosukKwon Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

tlrmchlsmth Dec 29, 2024

Choose a reason for hiding this comment

WoosukKwon commented Dec 22, 2024 •

edited by github-actions bot

Loading

WoosukKwon Dec 23, 2024 •

edited

Loading